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Abstract 

Nonparametric methods are widely applicable to statistical inference problems, since they rely on 
a few modeling assumptions. In this context, the fresh look advocated here permeates benefits from 
variable selection and compressive sampling, to robustify nonparametric regression against outliers - that 
is, data markedly deviating from the postulated models. A variational counterpart to least-trimmed squares 
regression is shown closely related to an £o-(pseudo)norm-regularized estimator, that encourages sparsity 
in a vector explicitly modeling the outliers. This connection suggests efficient solvers based on convex 
relaxation, which lead naturally to a variational M-type estimator equivalent to the least-absolute shrinkage 
and selection operator (Lasso). Outliers are identified by judiciously tuning regularization parameters, 
which amounts to controlling the sparsity of the outlier vector along the whole wbustification path of 
Lasso solutions. Reduced bias and enhanced generalization capability are attractive features of an improved 
estimator obtained after replacing the ^o-(pseudo)norm with a nonconvex surrogate. The novel robust 
sphne-based smoother is adopted to cleanse load curve data, a key task aiding operational decisions in 
the envisioned smart grid system. Computer simulations and tests on real load curve data corroborate the 
effectiveness of the novel sparsity-controlling robust estimators. 
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I. Introduction 

Consider the classical problem of function estimation, in which an input vector x := [xi, . . . , Xp]' G W 
is given, and the goal is to predict the real-valued scalar response y = /(x). Function / is unknown, 
to be estimated from a training data set T := {yi,^i}j^i- When / is assumed to be a member of a 
finitely-parameterized family of functions, standard (non-)linear regression techniques can be adopted. If 
on the other hand, one is only willing to assume that / belongs to a (possibly infinite dimensional) space 
of "smooth" functions T-L, then a nonparametric approach is in order, and this will be the focus of this 
work. 

Without further constraints beyond / G functional estimation from finite data is an ill-posed problem. 
To bypass this challenge, the problem is typically solved by minimizing appropriately regularized criteria, 
allowing one to control model complexity; see, e.g., llT2l . |[34l . It is then further assumed that T-L has the 
structure of a reproducing kernel Hilbert space (RKHS), with corresponding positive definite reproducing 
kernel function K{-,-) : W xW and norm denoted by || • Under the formalism of regularization 

networks, one seeks / as the solution to the variational problem 
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(1) 



where V{-) is a convex loss function, and /x > controls complexity by weighting the effect of the 
smoothness functional ||/|||^- Interestingly, the Representer Theorem asserts that the unique solution of 
([T]) is finitely parametrized and has the form /(x) = ^^^^ /3i A'(x, Xj), where can be obtained 

from T; see e.g., |[29l . |[38l . Further details on RKHS, and in particular on the evaluation of 
can be found in e.g., |38i, Ch. 1]. A fundamental relationship between model complexity control and 
generalization capability, i.e., the predictive ability of / beyond the training set, was formalized in ||37| . 

The generalization error performance of approaches that minimize the sum of squared model residuals 
[that is V{u) = V? in ([T])] regularized by a term of the form ||/|||^, is degraded in the presence of outliers. 
This is because the least-squares (LS) part of the cost is not robust, and can result in severe overfitting 
of the (contaminanted) training data 1211 . Recent efforts have considered replacing the squared loss with 
a robust counterpart such as Ruber's function, or its variants, but lack a data-driven means of selecting 
the proper threshold that determines which datum is considered an outlier P3l ; see also ll27l . Other 
approaches have instead relied on the so-termed e-insensitive loss function, originally proposed to solve 
function approximation problems using support vector machines (SVMs) |[37l . These family of estimators 
often referred to as support vector regression (SVR), have been shown to enjoy robustness properties; see 
e.g., |[26l . |[28l . ||32l and references therein. In ||8], improved performance in the presence of outliers is 
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achieved by refining tlie SVR solution tlirougli a subsequent robust learning phase. 

The starting point here is a variational least-trimmed squares (VLTS) estimator, suitable for robust 
function approximation in T-L (Section It is established that VLTS is closely related to an (NP-hard) 
£o-(pseudo)norm-regularized estimator, adopted to fit a regression model that exphcitly incorporates an 
unknown sparse vector of outliers ifTTl. As in compressive sampling (CS) ll35l . efficient (approximate) 
solvers are obtained in Section |IlI] by replacing the outlier vector's ^Q-norm with its closest convex 
approximant, the £i-norm. This leads naturally to a variational M-type estimator of /, also shown equivalent 
to a least-absolute shrinkage and selection operator (Lasso) ||33l on the vector of outliers (Section IIII-AI ). 
A tunable parameter in Lasso controls the sparsity of the estimated vector, and the number of outliers as 
a byproduct. Hence, effective methods to select this parameter are of paramount importance. 

The link between ^i-norm regularization and robustness was also exploited for parameter (but not 
function) estimation in ifTTl and |[22l ; see also BOl for related ideas in the context of face recognition, and 
error correction codes 0]. In ||T7| however, the selection of Lasso's tuning parameter is only justified 
for Gaussian training data; whereas a fixed value motivated by CS error bounds is adopted in the 
Bayesian formulation of ll22l . Here instead, a more general and systematic approach is pursued in Section 
IIII-B[ building on contemporary algorithms that can efficiently compute all robustifaction paths of Lasso 
solutions, i.e., for all values of the tuning parameter lITTI . |[T6l . PTl . In this sense, the method here 
capitalizes on but is not limited to sparse settings, since one can examine all possible sparsity levels 
along the robustification path. An estimator with reduced bias and improved generalization capability is 
obtained in Section |IVl after replacing the lo-nomv with a nonconvex surrogate, instead of the ^i-norm 
that introduces bias ll33l . P4l . Simulated tests demonstrate the effectiveness of the novel approaches 
in robustifying thin-plate smoothing splines |[TOl (Section IV-AK and in estimating the sine function 
(Section IV-BI ) - a paradigm typically adopted to assess performance of robust function approximation 
approaches ll8l. ll43l. 

The motivating application behind the robust nonparametric methods of this paper is load curve cleans- 
ing m - a critical task in power systems engineering and management. Load curve data (also known as 
load profiles) refers to the electric energy consumption periodically recorded by meters at specific points 
across the power grid, e.g., end user-points and substations. Accurate load profiles are critical assets 
aiding operational decisions in the envisioned smart grid system ll20l : see also ID, 121, 161. However, in 
the process of acquiring and transmitting such massive volumes of information to a central processing 
unit, data is often noisy, corrupted, or lost altogether. This could be due to several reasons including meter 
misscalibration or outright failure, as well as communication errors due to noise, network congestion, and 
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connectivity outages; see Fig. [T] for an example. In addition, data significantly deviating from nominal 
load models (outliers) are not uncommon, and could be attributed to unscheduled maintenance leading to 
shutdown of heavy industrial loads, weather constraints, hohdays, strikes, and major sporting events, just 
to name a few. 

In this context, it is critical to effectively reject outliers, and replace the contaminated data with 'healthy' 
load predictions, i.e., to cleanse the load data. While most utilities carry out this task manually based 
on their own personnel's know-how, a first scalable and principled approach to load profile cleansing 
which is based on statistical learning methods was recently proposed in |i6J; which also includes an 
extensive hterature review on the related problem of outlier identification in time-series. After estimating 
the regression function / via either B -spline or Kernel smoothing, pointwise confidence intervals are 
constructed based on /. A datum is deemed as an outlier whenever it falls outside its associated confidence 
interval. To control the degree of smoothing effected by the estimator, O requires the user to label the 
outliers present in a training subset of data, and in this sense the approach therein is not fully automatic. 
Here instead, a novel alternative to load curve cleansing is developed after specializing the robust estimators 
of Sections Hill and [TVl to the case of cubic smoothing splines (Section fV-CI) . The smoothness-and outlier 
sparsity-controUing parameters are selected according to the guidelines in Section IIII-Bl hence, no input 
is required from the data analyst. The proposed spline-based method is tested on real load curve data from 
a government building. 

Concluding remarks are given in Section |Vl] while some technical details are deferred to the Appendix. 
Notation: Bold uppercase letters will denote matrices, whereas bold lowercase letters will stand for column 
vectors. Operators (•)', tr(-) and E[-] will denote transposition, matrix trace and expectation, respectively; 
I • I will be used for the cardinality of a set and the magnitude of a scalar. The norm of vector x G 
is ||x||q := (XliLi for > 1; and ||M||i? := y^tr^MM^ is the matrix Frobenious norm. Positive 

definite matrices will be denoted by M ^ 0. The p x p identity matrix will be represented by Ip, while 
Op will denote the p x 1 vector of all zeros, and Opxg := OpO^- 



The training data comprises N noisy samples of / taken at the input points {xj}^^ (also known as 
knots in the splines parlance), and in the present context they can be possibly contaminated with outliers. 
Building on the parametric least-trimmed squares (LTS) approach [311, the desired robust estimate / can 
be obtained as the solution of the following variational (V)LTS minimization problem 



II. Robust Estimation Problem 




(2) 
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where r'^^if) is the i-th order statistic among the squared residuals r1(f), . . . ,r^(/), and rj(/) := in — 
/(xj). In words, given a feasible / G to evaluate the sum of the cost in ^ one: i) computes all 
squared residuals {rf{f)}fLi, ii) orders them to form the nondecreasing sequence r'^i]{f) < . . . < ?'pjv](/)' 
and iii) sums up the smallest s terms. As in the parametric LTS lOTI . the so-termed trimming constant s 
(also known as coverage) determines the breakdown point of the VLTS estimator, since the largest N — s 
residuals do not participate in Ideally, one would like to make N — s equal to the (typically unknown) 
number of outliers No in the training data. For most pragmatic scenaria where No is unknown, the LTS 
estimator is an attractive option due to its high breakdown point and desirable theoretical properties, 
namely \/]V-consistency and asymptotic normality IIBTI . 

The tuning parameter fi> in ^ controls the tradeoff between fidelity to the (trimmed) data, and the 
degree of "smoothness" measured by ||/|||^- In particular, ||/|||^ can be interpreted as a generalized ridge 
regularization term penalizing more those functions with large coefficients in a basis expansion involving 
the eigenfunctions of the kernel K. 

Given that the sum in Q is a nonconvex functional, a nontrivial issue pertains to the existence of the 
proposed VLTS estimator, i.e., whether or not (|2) attains a minimum in T-L. Fortunately, a (conceptually) 
simple solution procedure suffices to show that a minimizer does indeed exist. Consider specifically a 
given subsample of s training data points, say {yi,'^i}f-i, and solve 

s 

.1=1 

A unique minimizer of the form /(■'^(x) = J2i=i Pi''^ ^{'^^ ^i) is guaranteed to exist, where j is used here 
to denote the chosen subsample, and the coefficients {/Sp"*}!^^ can be obtained by solving a particular 
linear system of equations 1381 p. 11]. This procedure can be repeated for each subsample (there are 
J := (^) of these), to obtain a collection {/^■'^x)}^^^ of candidate solutions of Q. The winner(s) 
/ := /^■'*^ yielding the minimum cost, is the desired VLTS estimator. A remark is now in order. 
Remark 1 (VLTS complexity): Even though conceptually simple, the solution procedure just described 
guarantees existence of (at least) one solution, but entails a combinatorial search over all J subsamples 
which is intractable for moderate to large sample sizes N. In the context of linear regression, algorithms 
to obtain approximate LTS solutions are available; see e.g., ll30l . 

A. Robust function approximation via io-norm regularization 

Instead of discarding large residuals, the alternative approach proposed here explicitly accounts for 
outliers in the regression model. To this end, consider the scalar variables {oj}^^ one per training datum. 



min 
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taking the value Oj = whenever datum i adheres to the postulated nominal model, and Oj 7^ otherwise. 
A regression model naturally accounting for the presence of outliers is 

yi = f{^i) + Oi + ei, i = l,...,N (3) 

where {ei}^^ are zero-mean independent and identically distributed (i.i.d.) random variables modeling 
the observation eixors. A similar model was advocated under different assumptions in (TT] and ll22l . in 
the context of robust parametric regression; see also |@] and BOl . For an outlier-free datum i, ^ reduces 
to Ui = /(xj) + Ej; hence, Si will be often referred to as the nominal noise. Note that in both 
/ G H as well as the x 1 vector o := [oi, . . . ,on]' are unknown; thus, (|3]l is underdetermined. On 
the other hand, as outliers are expected to often comprise a small fraction of the training sample say, not 
exceeding 20% - vector o is typically sparse, i.e., most of its entries are zero; see also Remark |3] Sparsity 
compensates for underdeterminacy and provides valuable side-information when it comes to efficiently 
estimating o, identifying outliers as a byproduct, and consequently performing robust estimation of the 
unknown function /. 

A natural criterion for controlling outlier sparsity is to seek the desired estimate / as the solution of 

N 

Y^{y,-f{^,)-0,f + Ml^, S.t. ||o||o<T (4) 
.i=l 

where r is a preselected threshold, and ||o||o denotes the ^Q-norm of o, which equals the number of 
nonzero entries of its vector argument. Sparsity is directly controlled by the selection of the tuning 
parameter r > 0. If the number of outliers No were known a priori, then r should be selected equal 
to Nq. Unfortunately, analogously to related io-nonn constrained formulations in compressive sampling 
and sparse signal representations, problem © is NP-hard. In addition, (01) can be recast to an equivalent 
(unconstrained) Lagrangian form; see e.g., |l3] 



mm 
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(5) 



" N 

J] (y, - /(xi) - oO' + /ill/Ill, + Ao||o||o 
J=i 

where the tuning Lagrange multiplier Aq > plays a role similar to r in (IDl, and the ^Q-norm sparsity 
encouraging penalty is added to the cost. 

To further motivate model (|3]l and the proposed criterion (|5] for robust nonparametric regression, it is 
worth checking the structure of the minimizers {/, 6} of the cost in ([5]). Consider for the sake of argument 
that Aq is given, and its value is such that ||6||o = for some < u < N. The goal is to characterize 
/, as well as the positions and values of the nonzero entries of 6. Note that because ||6||o = u, the last 
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Vi - fi^i), it is 



Oi 



(6) 



temi in (|5]l is constant, hence inconsequential to the minimization. Upon defining fj 
not hard to see that the entries of 6 satisfy 

0, |fi| < ^/>^ , 

, 1 = 1,. ..,N 

h, \ri\ > 

at the optimum. This is intuitive, since for those 6j / the best thing to do in terms of minimizing the 
overall cost is to set Oj = fj, and thus null the corresponding squared-residual terms in In conclusion, 
for the chosen value of Aq it holds that v squared residuals effectively do not contribute to the cost in 

To determine the support of 6 and /, one alternative is to exhaustively test all admissible support 
combinations. For each one of these combinations (indexed by j), let Sj C {!,..., N} be the index set 



describing the support of o^-'), i.e., d- / if and only if i E Sj; and \S. 
corresponding candidate f^^'^ minimizes 



By virtue of the 
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while / is the one among all {f^^^} that yields the least cost. The previous discussion, in conjunction 
with the one preceding Remark [T] completes the argument required to establish the following result. 

Proposition 1: If {/,6} minimizes ([5]) with Xq chosen such that ||6||o = N — s, then f also solves the 
VLTS problem ©. 

The importance of Proposition [U is threefold. First, it formally justifies model ^ and its estimator ^ 
for robust function approximation, in light of the well documented merits of LTS regression [30] . Second, 
it further solidifies the connection between sparse linear regression and robust estimation. Third, the Iq- 
norm regularized formulation in (|5) lends itself naturally to efficient solvers based on convex relaxation, 
the subject dealt with next. 



III. Sparsity Controlling Outlier Rejection 

To overcome the complexity hurdle in solving the robust regression problem in one can resort to 
a suitable relaxation of the objective function. The goal is to formulate an optimization problem which 
is tractable, and whose solution yields a satisfactory approximation to the minimizer of the original hard 
problem. To this end, it is useful to recall that the £i-norm ||x||i of vector x is the closest convex 
approximation of ||x||o. This property also utilized in the context of compressive sampling |[35l . provides 
the motivation to relax the NP-hard problem ([5) to 

|2 
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(7) 
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Being a convex optimization problem, (|7]) can be solved efficiently. The nondifferentiable £i-nomi reg- 
ularization term controls sparsity on the estimator of o, a property that has been recently exploited in 
diverse problems in engineering, statistics and machine learning. A noteworthy representative is the least- 
absolute shrinkage and selection operator (Lasso) ll33l . a popular tool in statistics for joint estimation and 
continuous variable selection in linear regression problems. In its Lagrangian form. Lasso is also known 
as basis pursuit denoising in the signal processing literature, a term coined by ||7] in the context of finding 
the best sparse signal expansion using an overcomplete basis. 

It is pertinent to ponder on whether problem (|7) has built-in ability to provide robust estimates / in 
the presence of outliers. The answer is in the affirmative, since a straightforward argument (details are 

deferred to the Appendix) shows that ^ is equivalent to a variational M-type estimator found by 

r N 1 



Remark 2 (Regularized regression and robustness): Existing works on linear regression have pointed 
out the equivalence between ^i-norm regularized regression and M-type estimators, under specific assump- 
tions on the distribution of the outliers (e-contamination) ifTTl . ll23l . However, they have not recognized 
the link with LTS through the convex relaxation of and the connection asserted by Proposition [1] 
Here, the treatment goes beyond linear regression by considering nonparametric functional approximation 
in RKHS. Linear regression is subsumed as a special case, when the linear kernel K{-K,-y) := x'y is 
adopted. In addition, no assumption is imposed on the outlier vector. 

It is interesting to compare the Iq- and ^i-norm formulations [cf. (jS) and respectively] in terms 
of their equivalent purely variational counterparts in (2) and dSjl, that entail robust loss functions. While 
the VLTS estimator completely discards large residuals, p still retains them, but downweighs their effect 
through a linear penalty. Moreover, while dH) is convex, (12) is not and this has a direct impact on the 
complexity to obtain either estimator. Regarding the trimming constant s in Q, it controls the number 
of residuals retained and hence the breakdown point of VLTS. Considering instead the threshold Ai/2 in 
Huber's function p, when the outliers' distribution is known a-priori, its value is available in closed form 
so that the robust estimator is optimal in a well-defined sense [|211 . Convergence in probability of M-type 
cubic smoothing splines estimators - a special problem subsumed by ([8]l - was studied in Q- 




(8) 



where p : M — )■ M is a scaled version of Huber's convex loss function ll2n 




u\ < Ai/2 
u\ > Ai/2 



(9) 
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A. Solving the convex relaxation 

Because © is jointly convex in / and o, an alternating minimization (AM) algorithm can be adopted 
to solve Q, for fixed values of /i and Ai. Selection of these parameters is a critical issue that will be 
discussed in Section ITlI-BI AM solvers are iterative procedures that fix one of the variables to its most up 
to date value, and minimize the resulting cost with respect to the other one. Then the roles are reversed to 
complete one cycle, and the overall two-step minimization procedure is repeated for a prescribed number 
of iterations, or, until a convergence criterion is met. Letting A; = 0, 1, . . . denote iterations, consider that 
o := o^^'"^) is fixed in ([T]). The update for f^^'^ at the A;-th iteration is given by 



' := aremin 



N 2 



(10) 



which corresponds to a standard regularization problem for functional approximation in Ti 11211 . but with 
outlier-compensated data yyi — o\ ' j . • ^^11 known that the minimizer of the variational prob- 
lem ([Tol l is finitely parameterized, and given by the kernel expansion /(^^x) = Eili /^f ^^(x,Xi) fM . 
The vector /3 := . . . , fi^]' is found by solving the linear system of equations 

[K + ^I^]/3W=y-o('=-i) (11) 

where y := [yi, . . . , un]' , and the N x N matrix K ^ has entries := ir(xj, Xj). 

In a nutshell, updating /^^^ is equivalent to updating vector (3^^'^ as per ([TTI l. where only the independent 
vector variable y — o^^"^) changes across iterations. Because the system matrix is positive definite, the 
per iteration systems of linear equations (fTTT l can be efficiently solved after computing once, the Cholesky 
factorization of K + /^Iat. 

For fixed / := f^'^^ in dTJl, the outlier vector update o^'^) at iteration k is obtained as 



o'-'^^ := arg min 

oGM™ 



■ N „ 



(12) 

oeK" ' — ■ \ " / 

.i=l 



.i=l 

where r^'^' := yj — X]j-!=i f^j"' ^i'^i^^j)- Problem ([T2l ) can be recognized as an instance of Lasso for the 
so-termed orthonormal case, in particular for an identity regression matrix. The solution of such Lasso 
problems is readily obtained via soft-thresholding lITSl . in the form of 



(k) 

o) := o 



(rf,Ai/2), i = l,...,N (13) 



where S{z,^) := sign(2;)(|z| — 7)+ is the soft-thresholding operator, and (•)_!_ := max(0,-) denotes the 
projection onto the nonnegative reals. The coordinatewise updates in ([T3] ) are in par with the sparsifying 
property of the ii norm, since for "small" residuals, i.e., r^'^^ < Ai/2, it follows that oj'^'* = 0, and the 
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Algorithm 1 : AM solver 

Initialize o^^^^ = 0, and run till convergence 

for fc = 0, 1,. . . do 

Update /SC') solving [K + /ilw] /jC') = y - oC'-i). 

Update oC-^) via of ^ = S (y, - Ef=i /?f ^ A'(x„ x,), A1/2) , i = l,...,N. 
end for 

return /(x) = Eti /^^^^^(x, x,) 



i-th training datum is deemed outlier free. Updates (fTTT i and ([T3] ) comprise the iterative AM solver of the 
^i-norm regularized problem (|7}, which is tabulated as Algorithm [T] Convexity ensures convergence to 
the global optimum solution regardless of the initial condition; see e.g., [Bl- 

Algorithm [T] is also conceptually interesting, since it explicitly reveals the intertwining between the 
outlier identification process, and the estimation of the regression function with the appropriate outlier- 
compensated data. An additional point is worth mentioning after inspection of ( fT3] ) in the limit as fc — )• 00. 

(k) 

From the definition of the soft-thresholding operator S, for those "large" residuals fi := limfc_!,oo ^"1 
exceeding Ai/2 in magnitude, oi = fi — Ai/2 when fj > 0, and 6j = f j + Ai/2 otherwise. In other words, 
larger residuals that the method identifies as corresponding to outlier-contaminated data are shrunk, but 
not completely discarded. By plugging 6 back into (O, these "large" residuals cancel out in the squared 
error term, but still contribute linearly through the ^i-norm regularizer. This is exactly what one would 
expect, in light of the equivalence established with the variational M-type estimator in ([8]l. 

Next, it is established that an alternative to solving a sequence of linear systems and scalar Lasso 
problems, is to solve a single instance of the Lasso with specific response vector and (non-orthonormal) 
regression matrix. 

Proposition 2: Consider Oiasso defined as 

OLasso ■■= arg mill ||X^y - X^o||2 + Ai||o||i (14) 

oeK" 



where 

' Itv - K (K + filN) 



(15) 



Then the minimizers {/, 6} of ([7]) are fully determined given Oiasso, <^is 6 := Oiasso <^nd /(x) = Xli^i Xj), 

with /3 = (K + iAnT^ (y - OLasso)- 

Proof: For notational convenience introduce the x 1 vectors f := [/(xi), . . . , /(xyv)]' and f := 
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[/(xi), . . . , /(xtv)]', where f is, the minimizer of (|7]). Next, consider rewriting © as 



mm 



mm||(y -oj 



f|l2 



+ Aillol 



1- 



(16) 



The quantity inside the square brackets is a function of o, and can be written explicitly after carrying out 
the minimization with respect to / G From the results in ||38l . it follows that the vector of optimum 
predicted values at the points {xi}fLi is given by f = K/3 = K (K + j^tliy)"^ (y — o); see also the discus- 
sion after Similarly, one finds that = $'K0 = (y-o)' (K + /ilTv)"^ K (K + ^Itv)"^ (y-o). 
Having minimized (fT6l) with respect to /, the quantity inside the square brackets is (F^ := (K + ^Ijy)"^) 

2 



mm 



f|l2 + A* 









(y-o)-f 



= ||(y - o) - Kr^(y - o)\\l + /.(y - o)'r^Kr^(y - o) 

= 11(1^ - Kr^)y - {In - Kr/,)o||^ + /i(y - o)'r,,Kr,,(y - o). (17) 

After expanding the quadratic form in the right-hand side of (fTTl ). and eliminating the term that does not 
depend on o, problem ([T6] l becomes 

min [ll {In - Kr^)y - (Itv - Kr^)o||2 - 2//y'r^Kr^o + ^oT^KF^o + Ai ||o||i 



Completing the square one arrives at 

In - Kr, 



mm 

oGR" 



(^K)V2r, 



In - Kr^ 

(^K)V2r, 



o 



+ -^il|o||i 



which completes the proof. ■ 
The result in Proposition |2] opens the possibility for effective methods to select Ai. These methods to 
be described in detail in the ensuing section, capitalize on recent algorithmic advances on Lasso solvers, 
which allow one to efficiently compute OLasso for all values of the tuning parameter Ai. This is crucial for 
obtaining satisfactory robust estimates /, since controlling the sparsity in o by tuning Ai is tantamount 
to controlling the number of outliers in model 



B. Selection of the tuning parameters: robustification paths 

As argued before, the tuning parameters ^ and Ai in ([T]) control the degree of smoothness in / and the 
number of outliers (nonzero entries in OLasso). respectively. From a statistical learning theory standpoint, /i 
and Ai control the amount of regularization and model complexity, thus capturing the so-termed effective 
degrees of freedom |[T9l . Complex models tend to have worse generalization capability, even though 
the prediction error over the training set T may be small (overfitting). In the contexts of regularization 



IEEE TRANSACTIONS ON SIGNAL PROCESSING (SUBMITTED) 



II 



networks lfT2l and Lasso estimation for regression ll33l . corresponding tuning parameters are typically 
selected via model selection techniques such as cross-validation, or, by minimizing the prediction error 
over an independent test set, if available ||T9l . However, these simple methods are severely challenged in 
the presence of multiple outliers. For example, the swamping effect refers to a very large value of the 
residual rj corresponding to a left out clean datum {yj, Xj}, because of an unsatisfactory model estimation 
based on all data except i; data which contain outliers. 

The idea here offers an alternative method to overcome the aforementioned challenges, and the possibility 
to efficiently compute Olusso for all values of Ai, given fi. A brief overview of the state-of-the-art in Lasso 
solvers is given first. Several methods for selecting and Ai are then described, which differ on the 
assumptions of what is known regarding the outlier model (|3). 

Lasso amounts to solving a quadratic programming (QP) problem ||33l ; hence, an iterative procedure is 
required to determine OLasso in (fT4l) for a given value of Ai. While standard QP solvers can be certainly 
invoked to this end, an increasing amount of effort has been put recently toward developing fast algorithms 
that capitalize on the unique properties of Lasso. The LARS algorithm lITTI is an efficient scheme for 
computing the entire path of solutions (corresponding to all values of Ai), sometimes referred to as 
regularization paths. LARS capitalizes on piecewise linearity of the Lasso path of solutions, while incurring 
the complexity of a single LS fit, i.e., when Ai = 0. Coordinate descent algorithms have been shown 
competitive, even outperforming LARS when p is large, as demonstrated in |[T6l : see also lITSl . B2| . and 
the references therein. Coordinate descent solvers capitalize on the fact that Lasso can afford a very simple 
solution in the scalar case, which is given in closed form in terms of a soft-thresholding operation [cf. 
(fT3l)1. Further computational savings are attained through the use of warm starts flSl . when computing 
the Lasso path of solutions over a grid of decreasing values of Ai. An efficient solver capitalizing on 
variable separability has been proposed in 11411 . 

Consider then a grid of values of in the interval [^min, ^max]. evenly spaced in a logarithmic 
scale. Likewise, for each /i consider a similar type of grid consisting of Gx values of Ai, where Amax := 
2minj jy'X^x^ j| is the minimum Ai value such that OLasso 7^ Oat fT6l . and := [x^ i . . . x^ tv] in (fT4l ). 
Typically, Amin = eAmax with e = 10"^, say. Note that each of the values of p gives rise to a different 
A grid, since Amax depends on through X^. Given the previously surveyed algorithmic alternatives to 
tackle the Lasso, it is safe to assume that (fT4l ) can be efficiently solved over the (nonuniform) G^ x Gx 
grid of values of the tuning parameters. This way, for each value of fi one obtains Gx samples of the 
Lasso path of solutions, which in the present context can be referred to as robustification path. As Ai 
decreases, more variables OLasso,^ enter the model signifying that more of the training data are deemed to 
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contain outliers. An example of the robustification path is given in Fig. [3i 

Based on the robustification paths and the prior knowledge available on the outlier model (|3]l, several 
alternatives are given next to select the "best" pair {/i, Ai} in the grid G^^ x G\. 

Number of outliers is known: When Nq is known, by direct inspection of the robustification paths one can 
determine the range of values for Ai, for which OLasso has exactly No nonzero entries. Specializing to the 
interval of interest, and after discarding outliers which are now fixed and known, A'-fold cross-validation 
methods can be applied to determine Ai. 

Variance of the nominal noise is known: Supposing that the variance ai of the i.i.d. nominal noise variables 
£i in ^ is known, one can proceed as follows. Using the solution / obtained for each pair {/ij,Aj} on 
the grid, form the x G\ sample variance matrix II with ij-th entry 

[n^r■= E ^ll^o= iyu-f{^u)f/No (18) 

«|fiL.s».»=0 u|dLHW,»=0 

where No stands for the number of nonzero entries in OLasso- Although not made explicit, the right-hand 
side of (fTSl ) depends on {/ij,Aj} through the estimate /, OLasso and No- The entries [SI]jj correspond to 
a sample estimate of a^, without considering those training data {yj,Xj} that the method determined to 
be contaminated with outliers, i.e., those indices i for which OLasso,* 7^ 0. The "winner" tuning parameters 
{/i*,A^} := {/ij.,Aj*} are such that 

\i\f] :=argmin|[S],, -a^l (19) 
which is an absolute variance deviation (AVD) criterion. 

Variance of the nominal noise is unknown: If is unknown, one can still compute a robust estimate of 
the variance ai, and repeat the previous procedure (with known nominal noise variance) after replacing 
ai with ai in ( fT9l ). One option is based on the median absolute deviation (MAD) estimator, namely 

de ■= 1.4826 X median^ — median^ (fj) |) (20) 

where the residuals ri = t/i — /(xj) are formed based on a nonrobust estimate of /, obtained e.g., after 
solving dV]) with Ai = and using a small subset of the training dataset T. The factor 1.4826 provides an 
approximately unbiased estimate of the standard deviation when the nominal noise is Gaussian. Typically, 
as in (|20l ) is used as an estimate for the scale of the errors in general M-type robust estimators; see 
e.g., H and ||271. 

Remark 3 (How sparse is sparse): Even though the very nature of outliers dictates that No is typically 
a small fraction of - and thus o in (|3]l is sparse - the method here capitalizes on, but is not limited 
to sparse settings. For instance, choosing Ai G [Knin ~ 0, Amax] along the robustification paths allows 
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one to continuously control the sparsity level, and potentially select the right value of Ai for any given 
No G {!,••• ,N}. Admittedly, if No is large relative to A^, then even if it is possible to identify and 
discard the outliers, the estimate / may not be accurate due to the lack of outlier-free data. 



IV. Refinement via Nonconvex Regularization 

Instead of substituting ||o||o in (l5]l by its closest convex approximation, namely ||o||i, letting the surrogate 
function to be non-convex can yield tighter approximations. For example, the ^o-norm of a vector x G R" 
was surrogated in (51 by the logarithm of the geometric mean of its elements, or by XlILi 
rank minimization problems, apart from the nuclear norm relaxation, minimizing the logarithm of the 
determinant of the unknown matrix has been proposed as an alternative surrogate ||T41 . Adopting related 
ideas in the present nonparametric context, consider approximating (|5]l by 



mm 

oGR" 



N 



N 



(21) 



Y^iVi - /(xi) - Oif + fiWfWl + XoY^ log(|oi| + 5) 
.1=1 1=1 

where 5 is a sufficiently small positive offset introduced to avoid numerical instability. 

Since the surrogate term in (1211 is concave, the overall problem is nonconvex. Still, local methods based 
on iterative linearization of log(|oi| + 6), around the current iterate of'\ can be adopted to minimize (|2TI ). 
From the concavity of the logarithm, its local linear approximation serves as a global overestimator. Stan- 
dard majorization-minimization algorithms motivate minimizing the global linear overestimator instead. 
This leads to the following iteration for /c = 0, 1, . . . (see e.g., ||25l for further details) 

" N N 

Y^iVi - fi^i) - o,f + Mil + Ao wf\o. 



rf(fc) o('')l := arg min 

oGM" 



1=1 



1=1 



(22) 



1, 



(23) 



It is possible to eliminate the optimization variable / G H from (|22] |. by direct application of the result 
in Proposition |2l The equivalent update for o at iteration k is then given by 



o^'^^ := arg min 



N 



|X/,y - X/,o||2 + 



(24) 



which amounts to an iteratively reweighted version of (114) . If the value of \ol \ is small, then in the next 

iteration the corresponding regularization term \Qw'f\oi\ has a large weight, thus promoting shrinkage 

(fe— i)i 

of that coordinate to zero. On the other hand when |o) | is significant, the cost in the next iteration 
downweighs the regularization, and places more importance to the LS component of the fit. For small 5, 
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analysis of the limiting point o* of (124] ) reveals that 

Ao, |o*|/0 



\ow*\o*^ 



0, |o*|=0 



and hence, AoXlili^lo*! ~ Ao||o*||o. 

A good initialization for the iteration in (|24] | and (l23T i is OLasso, which corresponds to the solution of 
(fT4b [and (O] for Aq = A J and fi = fi*. This is equivalent to a single iteration of (|24l l with all weights 
equal to unity. The numerical tests in Section |V] will indicate that even a single iteration of (|24] | suffices 
to obtain improved estimates /, in comparison to those obtained from (fT4l i. The following remark sheds 
further light towards understanding why this should be expected. 

Remark 4 (Refinement through bias reduction): Uniformly weighted ^i-norm regularized estimators 
such as ([T]) are biased P4l . due to the shrinkage effected on the estimated coefficients. It will be argued 
next that the improvements due to (l24l ) can be leveraged to bias reduction. Several workarounds have been 
proposed to correct the bias in sparse regression, that could as well be applied here. A first possibility is 
to retain only the support of (IT4] | and re-estimate the amplitudes via, e.g., the unbiased LS estimator liTll . 
An alternative approach to reducing bias is through nonconvex regularization using e.g., the smoothly 
clipped absolute deviation (SCAD) scheme |[T3l . The SCAD penalty could replace the sum of logarithms 
in (I2TI ). still leading to a nonconvex problem. To retain the efficiency of convex optimization solvers while 
simultaneously limiting the bias, suitably weighted ^i-norm regularizers have been proposed instead [H4l . 
The constant weights in P4l play a role similar to those in (|23] ); hence, bias reduction is expected. 

V. Numerical Experiments 

A. Robust thin-plate smoothing splines 

To validate the proposed approach to robust nonparametric regression, a simulated test is carried out 
here in the context of thin-plate smoothing spline approximation lITOl . ll39l . Specializing Q to this setup, 
the robust thin-plate splines estimator can be formulated as 



mm 

oGK" 



N 

V(y^ - /(xi) - oif + ^, / IIvVIIf^x + Ai||o|| 



(25) 



where ||V^/||f denotes the Frobenius norm of the Hessian of / : — R. The penalty functional 

2 / o9 .\ 21 



dx (26) 



extends to the one-dimensional roughness regularization used in smoothing spline models. For fi = 0, 
the (non-unique) estimate in (|25] | corresponds to a rough function interpolating the outlier compensated 
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data; while as /i — )• oo the estimate is linear (of. V^/(x) = 02x2)- The optimization is over S, the space 
of Sobolev functions, for which J[f] is well defined lITOl p. 85]. Reproducing kernel Hilbert spaces such 
as S, with inner-products (and norms) involving derivatives are studied in detail in |[38l . 

Different from the cases considered so far, the smoothing penalty in (l26l ) is only a seminorm, since 
first-order polynomials vanish under J[-]. Omitting details than can be found in |[38l p. 30], under fairly 
general conditions a unique minimizer of (l25l l exists. The solution admits the finitely parametrized form 
/(^) = Sil=i PiK {'K, :x.i) + a']^x + ao, where in this case il'(x, y) := ||x — yplog ||x — y|| is a radial 
basis function. In simple terms, the solution as a kernel expansion is augmented with a member of the 
null space of J[-]. The unknown parameters {P,ai,ao} are obtained in closed form, as solutions to a 
constrained, regularized LS problem; see ll38l p. 33]. As a result. Proposition |2] still holds with minor 
modifications on the structure of X^. 

Remark 5 (Bayesian framework): Adopting a Bayesian perspective, one could model /(x) in ^ as 
a sample function of a zero mean Gaussian stationary process, with covariance function K(x, y) = 
||x — y|plog||x — y|| ||24l . Consider as well that {/(x), {oj, Ej}^^} are mutually independent, while 
£i ~ A/'(0,/x*/2) and Oj ~ £(0,/x*/AJ) in ^ are i.i.d. Gaussian and Laplace distributed, respectively. 
From the results in |]24| and a straightforward calculation, it follows that setting Ai = AJ and = fi* in 
(l25l) yields estimates / (and 6) which are optimal in a maximum a posteriori sense. This provides yet 
another means of selecting the parameters and Ai, further expanding the options presented in Section 

The simulation setup is as follows. Noisy samples of the true function /o : — R comprise the 
training set T. Function fo is generated as a Gaussian mixture with two components, with respective 
mean vectors and covariance matrices given by 





0.2295 




2.2431 


0.4577 




2.4566 




2.9069 


0.5236 


fj.1 = 






















0.4996 




0.4577 


1.0037 




2.9461 




0.5236 


1.7299 



Function /o(x) is depicted in Fig. |4](a). The training data set comprises = 200 examples, with inputs 
{xj}^]^ drawn from a uniform distribution in the square [0, 3] x [0, 3]. Several values ranging from 5% to 
25% of the data are generated contaminated with outliers. Without loss of generality, the corrupted data 
correspond to the first No training samples with No = {10, 20, 30, 40, 50}, for which the response values 
{yi}i^=i independently drawn from a uniform distribution over [—4,4]. Outlier-free data are generated 
according to the model yi = /o(xj) + Cj, where the independent additive noise terms ~ AA(0, 10~^) 
are Gaussian distributed, for i = No + I, ... , 200. For the case where A'^r, = 20, the data used in the 
experiment is shown in Fig. |2] Superimposed to the true function fo are 180 black points corresponding 
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TABLE I 

Results for the thin-plate splines simulated test 



No 






err for {T} 


err for i2]\ 


Errr for 


Errr for ^ 


10 


3.87 X 10"^ 


2.90 X 10"^ 


1.00 X 10"" 


1.03 X 10"" 


2.37 X 10"^ 


2.27 X 10"^ 


20 


3.83 X 10"^ 


L55 X 10~^ 


1.00 X 10"" 


9.16 X 10"^ 


4.27 X 10"^ 


2.39 X 10"^ 


30 


2.28 X 10"^ 


6.67 X 10"^ 


1.22 X 10"" 


1.18 X 10"" 


2.89 X 10"^ 


1.93 X 10"* 


40 


2.79 X 10"^ 


6.10 X 10"^ 


1.01 X 10"" 


1.14 X 10"" 


1.57 X 10"^ 


1.32 X 10"* 


50 


2.49 X 10"^ 


5.42 X 10"^ 


1.01 X 10"" 


9.9 X 10"^ 


1.19 X 10"^ 


1.05 X 10"* 



to data drawn from the nominal model, as well as 20 red outlier points. 

For this experiment, the nominal noise variance = 10"'^ is assumed known. A nonuniform grid 
of fi and Ai values is constructed, as described in Section IIII-BI The relevant parameters are = 
G\ = 200, /timin = 10"^ and /imax = 1- FoT each value of /i, the Ai grid spans the interval defined by 
Amax := 2 minj |y'X^x^^j| and Amin = eAmax> where e = lO""^. Each of the robustification paths 
corresponding to the solution of (fT4] l is obtained using the SpaRSA toolbox in IHl, exploiting warm 
starts for faster convergence. Fig. |3] depicts an example with No = 20 and ^* = 1.55 x 10"^. With the 
robustification paths at hand, it is possible to form the sample variance matrix SI [cf. (fTSlll. and select the 
optimum tuning parameters {^* , \\] based on the criterion ([T9] l. Finally, the robust estimates are refined 
by running a single iteration of (|24] | as described in Section |lVl The value 5 = 10"^ was utilized, and 
several experiments indicated that the results are quite insensitive to the selection of this parameter. 

The same experiment was conducted for a variable number of outliers No, and the results are listed 
in Table Jl In all cases, a 100% outlier identification success rate was obtained, for the chosen value of 
the tuning parameters. This even happened at the first stage of the method, i.e., OLasso in (fT4l) had the 
correct support in all cases. It has been observed in some other setups that (fT4l ) may select a larger support 
than [1, A'o], but after running a few iterations of (l24l the true support was typically identified. To assess 
quality of the estimated function /, two figures of merit were considered. First, the training error err was 
evaluated as 

1 ^ 2 

i.e., the average loss over the training sample T after excluding outliers. Second, to assess the generalization 



IEEE TRANSACTIONS ON SIGNAL PROCESSING (SUBMITTED) 



17 



capability of /, an approximation to the generalization error Err 7- was computed as 



Err 7- = E 



2/-/(x) \r 



1 ^ 

^ J](^.-/(xi) 

i=l 



(27) 



where {yi,'x.i}^i is an independent test set generated from the model iji = /a(xi) + Ej. For the results 
in Table Jl iV = 961 was adopted corresponding to a uniform rectangular grid of 31 x 31 points in 
[0,3] X [0,3]. Inspection of Table U reveals that the training errors err are comparable for the function 
estimates obtained after solving ^ or its nonconvex refinement (|2TI ). Interestingly, when it comes to the 
more pragmatic generalization error E1T7-, the refined estimator (1211 has an edge for all values of No- 
As expected, the bias reduction effected by the iteratively reweighting procedure of Section |IV] improves 
considerably the generalization capability of the method; see also Remark |4l 

A pictorial summary of the results is given in Fig lU for No = 20 outliers. Fig |4] (a) depicts the true 
Gaussian mixture /o(x), whereas Fig. |4](b) shows the nonrobust thin-plate splines estimate obtained after 
solving 



mm 
/e«s 



N 

i=i -^^^ 



(28) 



Even though the thin-plate penalty enforces some degree of smoothness, the estimate is severely disrupted 
by the presence of outliers [cf. the difference on the z-axis ranges]. On the other hand. Figs. S](c) and 
(d), respectively, show the robust estimate / with = 3.83 x 10^^, and its bias reducing refinement. 
The improvement is apparent, corroborating the effectiveness of the proposed approach. 



B. Sine function estimation 

The univariate function sinc(3;) := sin(7r2;)/(7rx) is commonly adopted to evaluate the performance 
of nonparametric regression methods |j8], P3l . Given noisy training examples with a small fraction of 
outliers, approximating sinc(x) over the interval [—5,5] is considered in the present simulated test. The 
sparsity-controUing robust nonparametric regression methods of this paper are compared with the SVR [37] 
and robust SVR in fSl, for the case of the e-insensitve loss function with values e = 0.1 and e = 0.01. In 
order to implement (R)SVR, routines from a publicly available SVM Matlab toolbox were utilized ifTSl . 
Results for the nonrobust regularization network approach in ^ (with V{u) = v?) are reported as well, 
to assess the performance degradation incurred when compared to the aforementioned robust alternatives. 
Because the fraction of outliers (Nq/N) in the training data is assumed known to the method of [8], the 
same will be assumed towards selecting the tuning parameters Ai and /i in Q, as described in Section 
IIII-BI The Ai}-grid parameters selected for the experiment in Section IV-AI were used here as well. 
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TABLE II 

Generalization ERROR (Errt) results for the sinc function estimation experiment 



Method 


cr^ = 1 X 10"* 


(t| = 1 X 10"^ 


erf = 1 X 10"^ 


Nnnrfthii^t rifTli with — 7/^1 




S 9S X 1 0"^ 


1.13 X 10"^ 


SVR with F — n 1 


fi 00 X 10~^ 


fi 42 X 10"* 


6 15 X 10~^ 


RSVR with e = 0.1 


1.10 X 10"=* 


5.10 X 10"* 


4.47 X 10"^ 


SVR with e = 0.01 


8.24 X 10"^ 


4.79 X 10"* 


5.60 X 10"-' 


RSVR with e = 0.01 


7.75 X lO"'^ 


3.90 X 10"* 


3.32 X 10"^ 


Sparsity-controUing in i|7j 


1.47 X 10"'' 


6.56 X 10"* 


4.60 X 10"-' 


Refinement in i2ll 


7.46 X 10"^ 


3.59 X 10"* 


3.21 X 10"-^ 



except for ^Umin = 10^^. Space Ti is chosen to be the RKHS induced by the positive definite Gaussian 
kernel function K{u,v) = exp [— (n — 'y)^/(2r/^)] , with parameter r] = 0.1 for all cases. 

The training set comprises = 50 examples, with scalar inputs {xi}fLi drawn from a uniform 
distribution over [—5,5]. Uniformly distributed outliers {yi}^^i ~ ZY[— 5,5] are artificially added in T, 
with No = 3 resulting in 6% contamination. Nominal data in T adheres to the model yi = sinc(xj) + Si 
for i = No + 1, N, where the independent additive noise terms ei are zero-mean Gaussian distributed. 
Three different values are considered for the nominal noise variance, namely cj^ = 1 x 10~' for / = 2, 3, 4. 
For the case where a"^ = 1 x 10~^, the data used in the experiment are shown in Fig. |5](a). Superimposed 
to the true function sinc(x) (shown in blue) are 47 black points corresponding to the noisy data obeying 
the nominal model, as well as 3 outliers depicted as red points. 

The results are summarized in Table |II1 which lists the generalization errors Err 7- attained by the 
different methods tested, and for varying a^. The independent test set {yi,Xi}^^ used to evaluate ( |27l ) 
was generated from the model fji = sinc(xj) + Sj, where the Xj define a. N = 101-element uniform 
grid over [—5,5]. A first (expected) observation is that all robust alternatives markedly outperform the 
nonrobust regularization network approach in ([1}, by an order of magnitude or even more, regardless of 
the value of cr^. As reported in |l8], RSVR uniformly outperforms SVR. For the case e = 0.01, RSVR 
also uniformly outperforms the sparsity-controUing method in (|7]). Interestingly, after refining the estimate 
obtained via (|7]) through a couple iterations of (|24] | (cf. Section ITVl) . the lowest generalization errors are 
obtained, uniformly across all simulated values of the nominal noise variance. Results for the RSVR with 
e = 0.01 come sufficiently close, and are equally satisfactory for all practical purposes; see also Fig. |5] 
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for a pictorial summary of the results when cr^ = 1 x 10~^. 

While specific error values or method rankings are arguably anecdotal, two conclusions stand out: (i) 
model (O and its sparsity-controUing estimators (|7]l and (|2TI) are effective approaches to nonparametric 
regression in the presence of outliers; and (ii) when initialized with OLasso the refined estimator (1211 can 
considerably improve the performance of (0, at the price of a modest increase in computational complexity. 
While (O endowed with the sparsity-controUing mechanisms of Section IIII-BI tends to overestimate the 
"true" support of o, numerical results have consistently shown that the refinement in Section |IV] is more 
effective when it comes to support recovery. 

C. Load curve data cleansing 

In this section, the robust nonparametric methods described so far are applied to the problem of load 
curve cleansing outlined in Section H Given load data T := corresponding to a building's 

power consumption measurements yi, acquired at time instants tj, i = I, . . . ,N, the proposed approach 
to load curve cleansing minimizes 



cleansed estimate of the load profile, and the support of 6 indicates the instants where significant load 
deviations, or, meter failures occurred. Estimator (^9) specializes ^ to the so-termed cubic smoothing 
splines; see e.g., |[T9ll . |[38l . It is also subsumed as a special case of the robust thin-plate splines estimator 
( |25l ). when the target function / has domain in M [cf. how the smoothing penalty ( [261 ) simplifies to the 
one in ( [29l ) in the one-dimensional case]. 

In light of the aforementioned connection, it should not be surprising that / admits a unique, finite- 
dimensional minimizer, which corresponds to a natural spline with knots at {tj}^^; see e.g., |[T9l p. 
151]. Specifically, it follows that f{t) = X]il=i where {hi{t)}f^^ is the basis set of natural spUne 

functions, and the vector of expansion coefficients 6 := [^i, . . . , O^]' is given by 



where matrix B G M^^^ has ij-th entry [B]ij = hj{ti); while * G M^><^ has ij-th entry = 
/ b'-{t)bj{t)dt. Spline coefficients can be computed more efficiently if the basis of B-splines is adopted 
instead; details can be found in fT9l p. 189] and 136]. 

Without considering the outlier variables in (|29] l. a B-spline estimator for load curve cleansing was put 
forth in [6|. An alternative Nadaraya- Watson estimator from the Kernel smoothing family was considered 




(29) 



where f"{t) denotes 




solution / provides a 



0= (B'B + /x*) ^B'(y-6) 
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as well. In any case, outliers are identified during a post-processing stage, after the load curve has been 
estimated nonrobustly. Supposing for instance that the approach in O correctly identifies outliers most of 



which accounts for the outlier compensated data to yield a cleansed estimate at once. Moreover, to select 
the "optimum" smoothing parameter fi, the approach of O requires the user to manually label the outliers 
present in a training subset of data, during a pre-processing stage. This subjective component makes it 
challenging to reproduce the results of [JSl, and for this reason comparisons with the aforementioned 
scheme are not included in the sequel. 

Next, estimator ( |29l ) is tested on real load curve data provided by the North Write Energy Group. The 
dataset consists of power consumption measurements (in kWh) for a government building, collected every 
fifteen minutes during a period of more than five years, ranging from July 2005 to October 2010. Data is 
downsampled by a factor of four, to yield one measurement per hour. For the present experiment, only a 
subset of the whole data is utilized for concreteness, where = 501 was chosen corresponding to a 501 
hour period. A snapshot of this training load curve data in T, spanning a particular three-week period 
is shown in Fig. [6] (a). Weekday activity patterns can be clearly discerned from those corresponding to 
weekends, as expected for most government buildings; but different, e.g., for the load profile of a grocery 
store. Fig. |6] (b) shows the nonrobust smoothing spline fit to the training data in T (also shown for 
comparison purposes), obtained after solving 



using Matlab's built-in spline toolbox. Parameter fx was chosen based on leave-one-out cross-validation, 
and it is apparent that no cleansing of the load profile takes place. Indeed, the resulting fitted function 
follows very closely the training data, even during the abnormal energy peaks observed on the so-termed 
"building operational transition shoulder periods." 

Because with real load curve data the nominal noise variance cj^ in Q is unknown, selection of the 
tuning parameters {/i, Ai} in ( [29l ) requires a robust estimate of the variance o"^ such as the MAD [cf. 
Section ITlI-BH . Similar to 161, it is assumed that the nominal errors are zero mean Gaussian distributed, so 
that ( [201 ) can be applied yielding the value = 0.6964. To form the residuals in (|20] l. (|30] l is solved first 
using a small subset of T that comprises 126 measurements. A nonuniform grid of /i and Ai values is 
constructed, as described in Section IIII-B I Relevant parameters are = 100, Gx = 200, /imin = 10^'^, 
A'max = 10, and e = 10~^. The robustification paths (one per fi value in the grid) were obtained using 
the SpaRSA toolbox in BTI . with the sample variance matrix X] formed as in (ITSl) . The optimum tuning 



the time, it still does not yield a cleansed estimate /. This should be contrasted with the estimator (|29] l. 




(30) 
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parameters fj,* = 1.637 and = 3.6841 are finally determined based on the criterion (1T9) . where the 
unknown is replaced with a^. Finally, the cleansed load curve is refined by running four iterations 
of (l24l ) as described in Section JVl with a value of 6 = 10~^. Results are depicted in Fig. |2l where the 
cleansed load curves are superimposed to the training data in T. Red circles indicate those data points 
deemed as outliers, information that is readily obtained from the support of 6. By inspection of Fig. |7J 
it is apparent that the proposed sparsity-controlling estimator has the desired cleansing capability. The 
cleansed load curves closely follow the tiaining data, but are smooth enough to avoid overfitting the 
abnormal energy peaks on the "shoulders." Indeed, these peaks are in most cases identified as outliers. As 
seen from Fig. |7](a), the solution of (|29l ) tends to overestimate the support of o, since one could argue that 
some of the red circles in Fig. |7](a) do not correspond to outliers. Again, the nonconvex regularization in 
Section |IV] prunes the outlier support obtained via (|29] l. resulting in a more accurate result and reducing 
the number of outliers identified from 77 to 41. 

VI. Concluding Summary 

Outlier-robust nonparametric regression methods were developed in this paper for function approxima- 
tion in RKHS. Building on a neat link between the seemingly unrelated fields of robust statistics and 
sparse regression, the novel estimators were found rooted at the crossroads of outlier-resilient estimation, 
the Lasso, and convex optimization. Estimators as fundamental as LS for linear regression, regularization 
networks, and (thin-plate) smoothing splines, can be robustified under the proposed framework. 

Training samples from the (unknown) target function were assumed generated from a regression model, 
which explicitly incorporates an unknown sparse vector of outliers. To fit such a model, the proposed 
variational estimator minimizes a tradeoff between fidelity to the training data, the degree of "smoothness" 
of the regression function, and the sparsity level of the vector of outliers. While model complexity control 
effected through a smoothing penalty has quite well understood ramifications in terms of generalization 
capability, the major innovative claim here is that sparsity control is tantamount to robustness control. 
This is indeed the case since a tunable parameter in a Lasso reformulation of the variational estimator, 
controls the degree of sparsity in the estimated vector of model outliers. Selection of tuning parameters 
could be at first thought as a mundane task. However, arguing on the importance of such task in the 
context of robust nonparametric regression, as well as devising principled methods to effectively carry out 
smoothness and sparsity control, are at the heart of this paper's novelty. Sparsity control can be carried 
out at affordable complexity, by capitalizing on state-of-the-art algorithms that can efficiently compute the 
whole path of Lasso solutions. In this sense, the method here capitalizes on but is not limited to sparse 
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settings where few outliers are present, since one can efficiently examine the gamut of sparsity levels 
along the robustification path. Computer simulations have shown that the novel methods of this paper 
outperform existing alternatives including SVR, and one if its robust variants. 

As an application domain relevant to robust nonparametric regression, the problem of load curve 
cleansing for power systems engineering was also considered along with a solution proposed based on 
robust cubic spline smoothing. Numerical tests on real load curve data demonstrated that the smoothness 
and sparsity controlling methods of this paper are effective in cleansing load profiles, without user 
intervention to aid the learning process. 
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Appendix 

Towards establishing the equivalence between problems ^ and dSjl, consider the pair {/, 6} that solves 
(|7]l. Assume that / is given, and the goal is to determine 6. Upon defining the residuals fj := in — f{'x.i) 
and because ||o||i = Xli^i the entries of 6 are separately given by 

Oj := arg mill ["(rj - Oj)^ + All Oil] , i = l,...,N, (31) 

where the term /x||/|||^ in ([7]) has been omitted, since it is inconsequential for the minimization with 
respect to o. For each i = 1, . . . , N, because ( [3T| ) is nondifferentiable at the origin one should consider 
three cases: i) if 6j = 0, it follows that the minimum cost in (|3TI ) is ff; ii) if 6j > 0, the first-order 
condition for optimality gives Oj = fj — Ai/2 provided fj > Ai/2, and the minimum cost is Aifj — Af/4; 
otherwise, iii) if 6j < 0, it follows that Oj = f j + Ai/2 provided < — Ai/2, and the minimum cost is 
— Aifj — Af/4. In other words, 

Ti - \i/2, h > Ai/2 

0, |fi|<Ai/2 , i = l,...,N. (32) 

^ fi + Ai/2, r, < -Ai/2 
Upon plugging (l32l ) into (|3T] ). the minimum cost in (|3TI ) after minimizing with respect to Oj is p{fi) [cf. 
Q and the argument preceding (|32]|1. All in all, the conclusion is that / is the minimizer of dD - in 
addition to being the solution of ^ by definition - completing the proof. ■ 
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Fig. 1. Example of load curve data with outliers. 




Fig. 2. True Gaussian mixture function /o(x), and its 180 noisy samples taken over [0,3] x [0,3] shown as black dots. The 
red dots indicate the No = 20 outliers in the training data set T. The green points indicate the predicted responses yi at the 
sampling points Xi, from the estimate / obtained after solving i25\ . Note how all green points are close to the surface fo- 
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Fig. 3. Robustification path with optimum smoothing parameter /i* — 1.55 x 10^"^. The data is con'upted with No — 20 outliers. 
The coefficients 6i corresponding to the outliers are shown in red, while the rest are shown in blue. The vertical line indicates 
the selection of = 3.83 x 10^^, and shows that the outliers were coiTectly identified. 




Fig. 4. Robust estimation of a Gaussian mixture using thin-plate splines. The data is corrupted with No = 20 outliers, (a) True 
function /o(x); (b) nonrobust predicted function obtained after solving ilSi ; (c) predicted function after solving ( I25t with the 
optimum tuning parameters; (d) refined predicted function using the nonconvex regularization in ( I2U . 
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(g) (h) 

Fig. 5. Robust estimation of the sine function. Tfie data is comipted with No = 3 outliers, and the nominal noise variance is 
— 1 X 10^*. (a) Noisy training data and outliers; (b) predicted values obtained after solving ([T} with V{u) — it^; (c) SVR 
predictions for e — 0.1; (d) RSVR predictions for e = 0.1; (e) SVR predictions for e — 0.01; (f) RSVR predictions for e = 0.01; 
(g) predicted values obtained after solving (|7}; (h) refined predictions using the nonconvex regularization in ( 121) . 
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Fig. 6. Load curve data cleansing, (a) Noisy training data and outliers; (b) fitted load profile obtained after solving 
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Fig. 7. Load curve data cleansing, (a) Cleansed load profile obtained after solving i29l : (b) refined load profile obtained after 
using the nonconvex regularization in ill) . 



